Cream of the Crop 11

home *** CD-ROM | disk | FTP | other *** search

/ Cream of the Crop 11 / Cream of the Crop 11-1.iso / compress / act_27.zip / CALGARY.SET < prev next >

Wrap

Text File | 1995-12-30 | 4KB | 88 lines

Version Twenty Seven, Archive Comparison Table [30 December 1995] [ACT\CALGARY.SET] The following files were used in the Calgary/Canterbury text compression corpus test. For more details see below. Name Size Description --------------------------------------------------------------------------- BIB 111,261 Bibliographic files (refer format) BOOK1 768,771 Hardy: Far from the madding crowd BOOK2 610,856 Witten: Principles of computer speech GEO 102,400 Geophysical data NEWS 377,109 News batch file OBJ1 21,504 Compiled code for Vax: compilation of progp OBJ2 246,814 Compiled code for Apple Macintosh: Knowledge support system PAPER1 53,161 Witten, Neal and Cleary: Arithmetic coding for data compression PAPER2 82,199 Witten: Computer (in)security PAPER3 46,526 Witten: In search of "autonomy" PAPER4 13,286 Cleary: Programming by example revisited PAPER5 11,954 Cleary: A logical implementation of arithmetic PAPER6 38,105 Cleary: Compact hash tables using bidirectional linear probing PIC 513,216 Picture number 5 from the CCITT Facsimile test files (text + drawings) PROGC 39,611 C source code: compress version 4.0 PROGL 71,646 Lisp source code: system software PROGP 49,379 Pascal source code: prediction by partial matching evaluation program TRANS 93,695 Transcript of a session on a terminal --------------------------------------------------------------------------- 18 Files, 3,251,493 bytes in total size, but actually takes up 3,325,952 bytes, due to file slack (2%). *** More Details *** This corpus is used in the book Bell, T.C., Cleary, J.G. and Witten, I.H. Text compression. Prentice Hall, Englewood Cliffs, NJ, 1990 and in the survey paper Bell, T.C., Witten, I.H. and Cleary, J.G. "Modeling for text compression," Computing Surveys 21(4): 557-591; December 1989, to evaluate the practical performance of various text compression schemes. Several other researchers are now using the corpus to evaluate text compression schemes. Nine different types of text are represented, and to confirm that the performance of schemes is consistent for any given type, many of the types have more than one representative. Normal English, both fiction and non-fiction, is represented by two books and papers (labeled book1, book2, paper1, paper2, paper3, paper4, paper5, paper6). More unusual styles of English writing are found in a bibliography (bib) and a batch of unedited news articles (news). Three computer programs represent artificial languages (progc, progl, progp). A transcript of a terminal session (trans) is included to indicate the increase in speed that could be achieved by applying compression to a slow line to a terminal. All of the files mentioned so far use ASCII encoding. Some non-ASCII files are also included: two files of executable code (obj1, obj2), some geophysical data (geo), and a bit-map black and white picture (pic). The file geo is particularly difficult to compress because it contains a wide range of data values, while the file pic is highly compressible because of large amounts of white space in the picture, represented by long runs of zeros. More details of the individual texts are given in the book mentioned above. Both book and paper give the results of compression experiments on these texts. The corpus itself constitutes files bib, book1, book2, geo, news, obj1, obj2, paper1, paper2, paper3, paper4, paper5, paper6, pic, progc, progl, progp and trans. (The book and paper above do not give results for files paper3, paper4, paper5 or paper6.) The directory "index" contains the sizes of the files and some information about where they came from. Ian H. Witten Timothy C. Bell Computer Science Department Computer Science Department University of Calgary University of Canterbury Calgary T2N 1N4, Canada Christchurch 1, New Zealand Phone (403) 220-6780 Phone (64-3) 642352 email: ian@cpsc.UCalgary.CA email: tim@cosc.canterbury.ac.nz